Filtering

As someone interested in business and finance reporting, I wanted to use a dataset that alligned with such interests. Taking into account the current pandemic we are living through, I thought it would be interesting to examine the relief given to businesses through the Small Business Administration’s Paycheck Protection Program (PPP) loans. The dataset I used can be located here: link. The dataset consists of 200,000 entries of PPP loan approvals in the state of Illinois from early April to early August. The dataset includes information on city, zipcode, loan amount, loan approval date, jobs reported, gender, race/ethnicity, lender, non-profit status, veteran status, business type, and NAISC code. As the dataset was incredibly large, I decided focusing on specifically Chicago would allow for more concentrated insight and interview material. The following code shows the manipulation I did to reduce the dataset into just Chicago. As not included data was filled as “Unanswered,” I needed to not only refine the dataset but also put it in a form that would recognize “Unanswered” as NA data.

#read in large IL dataset
PPP_Data <- read_csv("PPP Data up to 150k 080820 IL.csv")
#filtering into just the city Chicago
PPP_chicago <- PPP_Data %>%
  filter(City == c("Chicago", "CHICAGO")) # %>%
 # replace_with_na_all(condition = ~.x == "Unanswered")
 # commented this out for sake of loading my code because the unanswered changes are v tedious

After looking at some graphs in Tableau, I realized that some that were labeled Chicago in fact were not in the Chicago area as the zip placed them in other parts of Illinois. Thus, I decided to instead filter by zipcode. In viewing this data in R, I realized that there was some error on the part of the SBA as two observations had congressional districts that were not in IL. I decided to remove these two observations. This dataset has 49,611 observations which was a little less intimidating than the original amount.

PPP_Chicago <- PPP_Data %>%
  filter(Zip %in% c(60007, 60018, 60068, 60106, 60131, 60176, 60601, 60602, 60603, 60604, 60605, 60606, 60607, 60608, 60609, 60610, 60611, 60612, 60613, 60614, 60615, 60616, 60617, 60618, 60619, 60620, 60621, 60622, 60623, 60624, 60625, 60626, 60628, 60629, 60630, 60631, 60632, 60633, 60634, 60636, 60637, 60638, 60639, 60640, 60641, 60642, 60643, 60644, 60645, 60646, 60647, 60649, 60651, 60652, 60653, 60654, 60655, 60656, 60657, 60659, 60660, 60661, 60706, 60707, 60714, 60804, 60827)) %>%
  filter(CD != "IA-02", CD != "OH-15") %>%
  replace_with_na_all(condition = ~.x == "Unanswered")

I then realized that the NAICS code, which is a government distinction between different types of businesses, while interesting had 881 unique codes in the dataset. Thus, I decided to trim down the NAICS code into the two-digit industry code of which there are only 24. I used an inner join to attach the industry names to the dataset.

PPP_Chicago$NAICSCat <- strtrim(PPP_Chicago$NAICSCode, 2)
NAISC <- read_excel("NAISC.xlsx", col_types = c("text", 
    "text"))
PPP_Chicago <- PPP_Chicago %>%
  inner_join(NAISC, by = "NAICSCat")

In order to analyze this in Tableau, I needed to write out a text file.

write.table(PPP_Chicago, file = "PPP_Chicago.txt", sep = "\t", row.names = FALSE)

Questions

I had a few guiding questions that I wanted to think about going in: (1) Are certain groups of individuals likely to get more money from the Paycheck Protection Program? (2) Are certain lenders more prevalent in certain areas? Which banks handled the most money? Did more local lenders seem to prefer to lend to certain demographics or areas? How did this differ from the US population? (3) How did the type of business affect the money recieved? ## Some Preliminary Charts The first thing I was curious about was some general information on broad numbers dealing with PPP loans.

# total dollar amount of loans
sum(PPP_Chicago$LoanAmount)
## [1] 1397429694
# total amount of loans approved
nrow(PPP_Chicago)
## [1] 46472
# average loan granted
mean(PPP_Chicago$LoanAmount)
## [1] 30070.36
# median loan granted
median(PPP_Chicago$LoanAmount)
## [1] 20000
# a box plot of the loan data, the vast amount of outliers makes it difficult to see the stopping point of the right whisker, tells us that average will be somewhat skewed to the higher numbers
ggplot(PPP_Chicago, aes(x = LoanAmount)) +
  geom_boxplot(width = .5) +
  ylim(-1,1) +
  theme(plot.background = element_blank(),
        panel.background = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        panel.grid.major.x = element_line(color = "lightgrey"),
        axis.line.x = element_line(color = "black"))

Tableau makes line graphs much easier so I was curious about a few observations about loans over time. (1) How many loans were being approved over time (2) How much was being given out in loans over time. The shapes tell us that no particular day seemed to have a greater amount of higher loans than others as the shapes are generally similar.

Caption Caption

I wanted to do some preliminary analysis into some distributions (numerically) of different variables to try and answer my first question. The ones I thought would be very interesting, gender and race/ethnicity, unfortunately caused a bit of trouble. Some intial tables showed me that most observations did not include such data. Only 5% of observation included race/ethnicity data and 12% included gender. Thus, while I can use other data and compare to demographic data of Chicago in general (i.e. what are average loan amounts by zip code in comparison to the zipcodes average income or in comparison to the Black and Brown population proportions). This is something I considered as I decided how to build my final visualizations.

kable(PPP_Chicago %>%
  count(RaceEthnicity))
RaceEthnicity n
American Indian or Alaska Native 11
Asian 513
Black or African American 268
Hispanic 340
Puerto Rican 1
White 1396
NA 43943
kable(PPP_Chicago %>%
  count(Gender))
Gender n
Female Owned 1473
Male Owned 4682
NA 40317

To start on this final visualization, I made a map on Tableau that shows the average loan amount by zip code.

Caption

Caption

After some brief looks into those criteria, I began to look at some of the information that was more complete: lender, jobs reported, business type, and industry codes.

# making table with percentages of loans lenders were in control of (for comparison to national patterns) as well as counts, for sake of space I decided to only put lenders who loaned to more than 10 businesses
x <- PPP_Chicago %>%
  count(Lender)
kable(x %>%
  filter(n > 10) %>%
  mutate(pct = n/nrow(PPP_Chicago)*100) %>%
  arrange(desc(pct)))
Lender n pct
Cross River Bank 6624 14.2537442
Kabbage, Inc. 5683 12.2288690
JPMorgan Chase Bank, National Association 5479 11.7898950
Celtic Bank Corporation 2637 5.6743846
Bank of America, National Association 2539 5.4635049
Fifth Third Bank 2195 4.7232742
U.S. Bank, National Association 2004 4.3122741
WebBank 1392 2.9953520
BMO Harris Bank National Association 1317 2.8339645
Byline Bank 1075 2.3132209
First Midwest Bank 890 1.9151317
Wintrust Bank, National Association 778 1.6741264
PNC Bank, National Association 757 1.6289379
Citibank, N.A. 614 1.3212257
TCF National Bank 600 1.2911000
MBE Capital Partners 497 1.0694612
Lakeside Bank 440 0.9468067
International Bank of Chicago 401 0.8628852
Customers Bank 397 0.8542778
CIBC Bank USA 354 0.7617490
First American Bank 322 0.6928903
The Huntington National Bank 303 0.6520055
First Secure Bank and Trust Co. 253 0.5444138
Readycap Lending, LLC 248 0.5336547
Radius Bank 229 0.4927698
Beverly Bank & Trust Company, National Association 228 0.4906180
Intuit Financing Inc. 227 0.4884662
Signature Bank 212 0.4561887
Fundbox, Inc. 211 0.4540368
Park Ridge Community Bank 211 0.4540368
Providence Bank & Trust 202 0.4346703
Village Bank and Trust, National Association 196 0.4217593
Bank of Hope 190 0.4088483
Parkway Bank and Trust Company 190 0.4088483
Republic Bank of Chicago 190 0.4088483
Self-Help FCU 181 0.3894818
Associated Bank, National Association 180 0.3873300
Union National Bank 157 0.3378378
CRF Small Business Loan Company, LLC 154 0.3313823
Belmont Bank & Trust Company 136 0.2926493
Millennium Bank 131 0.2818902
Marquette Bank 124 0.2668273
First Eagle Bank 120 0.2582200
The Northern Trust Company 111 0.2388535
West Suburban Bank 107 0.2302462
Town Center Bank 106 0.2280943
Hinsdale Bank & Trust Company, National Association 103 0.2216388
Itasca Bank & Trust Co. 103 0.2216388
Cornerstone National Bank & Trust Company 99 0.2130315
Lake Forest Bank & Trust Company, National Association 82 0.1764503
Wells Fargo Bank, National Association 80 0.1721467
Newtek Small Business Finance, Inc. 78 0.1678430
American Express National Bank 77 0.1656912
Albany Bank and Trust Company, National Association 76 0.1635393
Selfreliance FCU 75 0.1613875
Busey Bank 72 0.1549320
First Secure Community Bank 72 0.1549320
Northbrook Bank & Trust Company, National Association 72 0.1549320
Northeast Bank 72 0.1549320
BankFinancial, National Association 71 0.1527802
First Home Bank 68 0.1463247
Glenview State Bank 68 0.1463247
Royal Savings Bank 68 0.1463247
Old Second National Bank 66 0.1420210
Devon Bank 63 0.1355655
The First National Bank of Ottawa 62 0.1334137
Inland Bank and Trust 61 0.1312618
Heartland Bank and Trust Company 60 0.1291100
Burling Bank 58 0.1248063
Gold Coast Bank 58 0.1248063
Libertyville Bank & Trust Company, National Association 58 0.1248063
Schaumburg Bank & Trust Company, National Association 56 0.1205027
Harvest Small Business Finance, LLC 52 0.1118953
Midwest Community Bank 51 0.1097435
Forest Park National Bank and Trust Company 48 0.1032880
Hanmi Bank 48 0.1032880
Waterman State Bank 47 0.1011362
Itria Ventures LLC 46 0.0989843
Live Oak Banking Company 46 0.0989843
First Bank of Highland Park 45 0.0968325
Peoples Bank 43 0.0925288
CDC Small Business Finance Corporation 42 0.0903770
American Metro Bank 41 0.0882252
FC Marketplace, LLC (dba Funding Circle) 40 0.0860733
Grand Ridge National Bank 40 0.0860733
BSD Capital, LLC dba Lendistry 35 0.0753142
CNB Bank and Trust, National Association 35 0.0753142
Old Plank Trail Community Bank, National Association 35 0.0753142
American Community Bank & Trust 34 0.0731623
First Bank of the Lake 34 0.0731623
Silicon Valley Bank 34 0.0731623
Evergreen Bank Group 33 0.0710105
First Bank and Trust Company of Illinois 32 0.0688587
Centier Bank 29 0.0624032
Oxford Bank & Trust 29 0.0624032
Wheaton Bank & Trust Company, National Association 29 0.0624032
CenTrust Bank, National Association 28 0.0602513
Great Lakes CU 28 0.0602513
Midland States Bank 28 0.0602513
St. Charles Bank & Trust Company, National Association 28 0.0602513
Capital One, National Association 27 0.0580995
Fountainhead SBF LLC 27 0.0580995
North Side Federal Savings & Loan Association of Chicago 26 0.0559477
Greater Nevada CU 24 0.0516440
Merchants and Manufacturers Bank 24 0.0516440
Pacific City Bank 23 0.0494922
1st Equity Bank 22 0.0473403
Allied First Bank 22 0.0473403
Community Bank of Elmhurst 22 0.0473403
Citizens Bank, National Association 21 0.0451885
First Bank of Manhattan 21 0.0451885
The Leaders Bank 21 0.0451885
Transportation Alliance Bank, Inc. d/b/a TAB Bank, Inc. 21 0.0451885
Brickyard Bank 20 0.0430367
Cache Valley Bank 20 0.0430367
Citizens Bank of Chatsworth 20 0.0430367
Crystal Lake Bank & Trust Company, National Association 20 0.0430367
United Fidelity Bank, FSB 20 0.0430367
American Eagle Bank 19 0.0408848
FNBC Bank and Trust 19 0.0408848
Algonquin State Bank 18 0.0387330
Bank of the West 18 0.0387330
Barrington Bank & Trust Company, National Association 18 0.0387330
Citizens Bank 18 0.0387330
La Salle State Bank 18 0.0387330
Legacy Bank 18 0.0387330
Notre Dame FCU 18 0.0387330
Bank of Pontiac 17 0.0365812
Idaho First Bank 17 0.0365812
The Bancorp Bank 17 0.0365812
American Community Bank of Indiana 15 0.0322775
First Merchants Bank 15 0.0322775
Navy FCU 15 0.0322775
Amalgamated Bank of Chicago 14 0.0301257
Cathay Bank 14 0.0301257
Hometown National Bank 14 0.0301257
BankUnited, National Association 13 0.0279738
Blue Ridge Bank, National Association 13 0.0279738
Hiawatha National Bank 13 0.0279738
Horizon Bank 13 0.0279738
McHenry Savings Bank 13 0.0279738
NorthSide Community Bank 13 0.0279738
The State Bank 13 0.0279738
Baxter CU 12 0.0258220
Carrollton Bank 12 0.0258220
First National Bank of Omaha 12 0.0258220
First Republic Bank 12 0.0258220
Stearns Bank National Association 12 0.0258220
United Community Bank 12 0.0258220
Northwest FCU 11 0.0236702
Princeville State Bank 11 0.0236702

I wanted to explore lenders more. I found some data from how nationally loans were dispersed by lender. Caption Though some of the big actors were also big in Chicago, there were a few notable differences. For instance, the second largest number of loans came from Kabbage, Inc. in Chicago, while it did not break the top 15 nationally. Additionally, it appears that some of the larger lenders on the national list had even more authority in the Chicago area. While JPMorgan Chase loaned 4.4% of loans nationally, they handled almost 12% of all PPP loans from this first round in the Chicago area. Some of the banks with bigger shares of total authority that did not appear on the national list appear to be Chicago/Midwest based: Byline Bank, First Midwest Bank, TCF National Bank (more midwest base), Lakeside Bank, and International Bank of Chicago. I was interested to see if these banks seemed to lend to certain areas more than the other larger banks. Businesses in the north side seemed to have access to more lenders than those on the south side, in general. Caption In comparison, here is a map of how many loans were dispersed throughout Chicago: Caption What was very intersting was that though there were a large amount of loans dispersed 60620, 60619, and 60628, there was less diversity in lenders. This led me to another question: who was dominating lending in these areas?

I was interested in seeing how this broke up by the Chicago/Midwest based big lenders versus the big 15 that nationally had a lot of the total loan authority. I realized after looking at the visualizations that one of the reasons why these maps might look so similar was because the number of lenders in the smaller percentage areas had more lenders that probably diluted the authority of these two groups I looked at. Caption Caption I decided to go back to my original question of how lenders by individual lender vary by zip code. I needed to do some maneauvering to obtain a dataset that gave me for each zipcode the lender with the most loans. I thought this might provide some insight that the previous two visualizations couldn’t offer.

y <- PPP_Chicago %>%
  group_by(Zip) %>%
  count(Lender)
y <- y %>%
  group_by(Zip) %>%
  mutate(BigLend = max(n)) %>%
  filter(n == BigLend)
write.table(y, file = "MaxLender.txt", sep = "\t", row.names = FALSE)

Caption This graphic was so interesting to me, and I knew I wanted to use it in my story. The divide between the North and South is fairly stark as the lenders seem segregated to certain areas. I decided I wanted to look at this in my story and do some research into the marketing tactics of specifically Kabbage, INc. and Cross River Bank. I also wanted to see if they had any programs that were focused on these areas.

With this in mind, I shifted my attention to business type.

# making a table looking at the specifics of business types
kable(PPP_Chicago %>%
  group_by(BusinessType) %>%
  summarise(avgloan = mean(LoanAmount),
            totalloans = sum(LoanAmount),
            medloan = median(LoanAmount),
            avgsize = mean(JobsReported, na.rm = TRUE)) %>%
  arrange(desc(totalloans)))
BusinessType avgloan totalloans medloan avgsize
Corporation 38859.99 501915653.0 25200.00 6.100379
Limited Liability Company(LLC) 38615.10 303939472.3 22959.00 5.183796
Subchapter S Corporation 36817.15 234414770.5 22940.00 4.851329
Sole Proprietorship 16750.05 168137018.7 20000.00 2.394313
Self-Employed Individuals 15395.19 70232855.2 18750.00 1.258774
Non-Profit Organization 45098.47 50600482.9 33763.75 7.036630
Independent Contractors 12066.85 34414669.9 12500.00 1.173048
Partnership 40225.87 9855337.6 29358.77 4.447257
Limited Liability Partnership 51303.60 8721612.8 42888.50 7.503106
Professional Association 38493.65 6736389.4 24988.00 6.294118
NA 55263.47 4476341.0 47400.00 6.312500
Cooperative 53391.43 3523834.1 40250.00 5.906250
Non-Profit Childcare Center 54370.20 271851.0 52220.00 7.600000
Joint Venture 94702.62 189405.2 94702.62 10.000000

The chart was particularly interesting to me as the business type seemed to make a big difference to how much in loans one was able to get. I created the following visualization to try and gain some insight. I first created a box plot that showed each business type and the loan distributions. Caption I was aware that size of the company might be a factor among why certain groups (sole proprieters, individual contractors, and self-employed people) may have lower loan averages/medians. Thus, I decided to look at some regression lines on scatterplots of job size and loan amount, grouped by business type. I removed any extreme outliers from the data to avoid any values that would vastly change the regression. I also removed to categories that had very few points: non-profit childcare center and joint venture. Caption

The regression trend lines are much flatter for independent contractors, sole proprietors, and self-employed individuals. Thus, even when additional jobs were added in those business types, the impact of another person on payroll increased the average loan amount recieved by much less than corporations and LLCs. I was interested in this question particularly: what made these groups less likely to recieve more money. In my story, I explore this using visualizations as well as outside reporting context through Lotika Pai, managing director at the Women’s Business Development Center Access to Capital group.

Looking at business type made me start to think about how such differences look at an industry level.

# making a table looking at the specifics of industries
kable(PPP_Chicago %>%
  group_by(Industry) %>%
  summarise(avgloan = mean(LoanAmount),
            totalloans = sum(LoanAmount),
            medloan = median(LoanAmount),
            avgsize = mean(JobsReported, na.rm = TRUE)) %>%
  arrange(desc(totalloans)))
Industry avgloan totalloans medloan avgsize
Professional, Scientific, and Technical Services 32591.16 229409183.4 20805.00 2.851852
Other Services (except Public Administration) 26176.60 159101399.2 19500.00 3.945164
Health Care and Social Assistance 37053.64 150326602.6 21670.20 4.990464
Accommodation and Food Services 40881.86 150240845.9 25535.00 9.580398
Retail Trade 26617.13 107586454.0 19253.00 4.612694
Transportation and Warehousing 17645.84 80500307.6 14090.09 2.185086
Construction 29392.66 76891209.7 19226.00 3.289346
Real Estate Rental and Leasing 29552.85 76187241.4 20332.00 3.257446
Manufacturing 46417.04 73153253.2 30637.50 5.694133
Finance and Insurance 26345.81 72108487.7 20000.00 2.159023
Wholesale Trade 42320.83 58106499.5 24410.50 4.847432
Administrative and Support and Waste Management and Remediation Services 28277.99 51918391.9 19137.00 4.392576
Arts, Entertainment, and Recreation 22546.42 32399208.6 16582.00 4.932692
Information 34496.39 26665709.1 20508.00 3.332875
Educational Services 27889.66 26327838.9 19499.50 5.014706
Agriculture, Forestry, Fishing and Hunting 20512.29 17168787.5 20000.00 6.666667
Public Administration 33942.78 4039190.9 20833.00 3.267241
Management of Companies and Enterprises 26430.88 3964632.5 20000.00 5.586466
Mining 36108.51 686061.8 20833.00 4.578947
Utilities 27016.18 648388.3 18929.50 3.000000

What was interesting about this table was that professional, scientific, and technical services had the largest amount of loans (in dollars) given to them but had a lower loan average than a lot of other industries. Had I had more time for reporting I would be interested to see exactly what each classification means/consists of in Chicago (perhaps talking to some local small business focused resource centers). I decided to make a similar box plot to the business types box plot to again look at the distribution.

Caption The transportatio nand warehousing industry seemed to have the lowest median loan amount while still having a vast array of outliers. The industries were also something I wanted to explore in my story. I made to bar charts whose information I wanted to combine into one: Caption Caption Looking at the two, the information is valuable in each. However, layering the average loans over the total will be difficult because of how much higher the sum of all loans is in comparison to the individual. As a result, I was deciding how to display such information and realized that perhaps average loan amount and count gives a lot of valuable information. In addition, having a median line to show median as there are so many outliers may also be helpful.

I decided that a separate graph that perhaps shows the makeup of all the PPP loans would perhaps fit better. Since some portions are incredible small or were not given, I decided to create an other category that encompassed those categories: null, management of companies, mining, public administration, and utilities. Though pie charts make it difficult to detect small differences, I wanted some way to show the composition of PPP loans by industry as a part of the whole.

Building the Maps

Because we were encouraged to host the final piece on a website, I decided to use ArcGIS’s Storymap software to host both my story and to enable me to add interactivity to my visualization. A lot of the most interesting data I found was through looking at maps thus I decided to use the ArcGIS interface to create interactive maps alongside my narrative.

The first challenge was seeing how to combine existing data from ArcGIS online into my dataset as unlike Tableau, ArcGIS requires shapefiles in order to draw zipcode boundaries. I played around, attempting to merge layers which proved unsuccessful but through use of a join features command, I was able to add my data to existing shapefile data.

Using this technique, I was able to add my information I wanted. However, because the dataset was quite large, I had to create individual datasets that included the map data I was interested in.

PPP_condensed <- PPP_Chicago %>%
  group_by(Zip) %>%
  summarise(avgloan = mean(LoanAmount),
            totalloans = sum(LoanAmount),
            medloan = median(LoanAmount),
            avgsize = mean(JobsReported, na.rm = TRUE),
            lenders = n_distinct(Lender),
            loans = n())
write.table(PPP_condensed, file = "PPP-small.txt", sep = "\t", row.names = FALSE)